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Cluster sampling is common in survey practice, and the corresponding infer- 
ence has been predominantly design based. We develop a Bayesian framework 
for cluster sampling and account for the design effect in the outcome model- 
ing. We consider a two-stage cluster sampling design where the clusters are 
first selected with probability proportional to cluster size, and then units are 
randomly sampled inside selected clusters. Challenges arise when the sizes of 
the nonsampled cluster are unknown. We propose nonparametric and para- 
metric Bayesian approaches for predicting the unknown cluster sizes, with this 
inference performed simultaneously with the model for survey outcome, with 
computation performed in the open-source Bayesian inference engine Stan. 
Simulation studies show that the integrated Bayesian approach outperforms 
classical methods with efficiency gains, especially under informative cluster 
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1 | INTRODUCTION 


Cluster sampling has been widely implemented in epidemiology and public health surveys,’ but challenges arise when 
applying classical survey estimates for summaries other than population averages and totals. Bayesian modeling has 
potential advantages for small area estimation? and adjusting for many poststratification factors.* However, most of the 
work in this area has been done for one-stage sampling or ignoring clustering in the data collection. In the present paper, 
we demonstrate hierarchical Bayesian inference for two-stage probability proportional to size (PPS) sampling, with the 
understanding that other designs could be modeled in similar ways. 

Cluster sampling increases cost efficiency when partial clusters are included in the probability sampling framework. 
Bayesian cluster sampling inference is essentially the outcome prediction for nonsampled units in the sampled clus- 
ters and all units in the nonsampled clusters. The design information should be accounted for in the modeling, but 
design information for nonsampled clusters is often unknown or inaccessible. We introduce estimation strategies for 
such information and connect multilevel regression models to sampling design as a unified framework for survey 
inference. 
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We consider the design that involves first sampling primary sampling units (PSUs) and then sampling secondary sam- 
pling units (SSUs) within selected PSUs. The two-stage cluster sampling design has played an important role in the 
sequential data collection process of many big health surveys, such as the National Health Interview Survey and the Med- 
ical Expenditure Panel Survey. This design requires a complete listing of PSUs and a complete listing of SSUs only within 
selected PSUs, and thus is widely used when generating a sampling frame of every unit in the population is infeasible 
or impractical. For example, in designing a nationally representative household survey, generating a complete listing of 
every household in the country requires essentially as much effort as a complete census of all households. Instead, the 
sampling proceeds in stages, first sampling PSUs such as counties, cities, or census tracts. The PSUs are sampled with PPS, 
which is commonly the number of SSUs in the PSU but can be a more general measure of size, such as annual revenue or 
agricultural yield. The SSUs are then randomly selected within selected PSUs, often with a fixed number or proportion. 
This design assumes independence and invariance of the second-stage sampling design.* Invariance means that the sam- 
pling of SSUs is independent of which PSUs are sampled, and independence means that sampling of SSUs in one PSU is 
independent of sampling in other PSUs. For clarification, a two-phase design is one in which one or both assumptions do 
not hold. 

Our technical challenge is in performing inference for the entire population without knowing the sizes of the unsampled 
clusters. In many cases, these measures of size may not be available from the data producer for reasons of confiden- 
tiality, the proprietary nature of the data, lack of historical records for surveys done far enough in the past, or a simple 
unwillingness to share data. The usual classical survey estimates for cluster sampling do not use the population distribu- 
tion of measures of size, and this may be one reason why such information is often not available. When the population 
distribution of measure of size is available, it indeed makes sense to use this information, and then the Bayesian infer- 
ence is straightforward via multilevel modeling and poststratification. Here, we consider the more challenging case of an 
unknown distribution of cluster sizes in the population. 

Our motivating application survey, ie, the Fragile Families and Child Wellbeing study,* was collected via a multistage 
design, where two-stage cluster sampling served as a key step. The study aims to examine the conditions and capabilities 
of new unwed parents, the wellbeing of their children, and the policy and environmental effect. To obtain a nationally 
representative sample of nonmarital births in large US cities, the study sequentially sampled cities, hospitals, and births. 
The sampling of cities used a stratified random sample of all US cities with 200 000 or more people, where the stratification 
was based on policy environments and labor market conditions in the different cities. Inside each stratum, cities were 
selected with probability proportional to the city population size. In the selected cities, all hospitals in the small cities 
were included, whereas a random sample of hospitals or the hospital with the largest number of non-marital births was 
selected in large cities. Lastly, a predetermined number of births were selected inside each hospital. Classical weighting 
adjustment for the complex study design results in highly variable weights,° ranging from 0.06 to 8600, thus yielding 
unstable inferences.” Multilevel models are fitted on poststratification cells constructed by the discretization of weights’ 
and demonstrate efficiency gains of Bayesian approaches, comparing with the design-based weighting estimate, especially 
for small cell estimation. We would like to develop a Bayesian framework to account for the complex designs under 
two-stage cluster sampling. 

Our goal is to develop hierarchical models and account for design effects to yield valid and robust survey inference. 
Bayesian hierarchical models are well equipped to handle the multistage design and stabilize estimation via smoothing. 
AS an intermediate step, two-stage cluster sampling is crucial in the Fragile Families study to select cities and hospitals. 
However, cluster sampling presents unique methodology challenges as little information is available on the unselected 
clusters. This article uses the Fragile Families study as an illustration and focuses on the Bayesian cluster sampling infer- 
ence to build a unified survey inference framework. The unified framework can be extended under a complex sampling 
design, as discussed in Section 5. 

We illustrate finite population inference with the estimation of the population mean in a two-stage cluster sample. 
Specifically, we consider a population of J clusters, with each cluster j containing Nj units and a total population size of 
N= a Nj. Let J; denote the inclusion indicator for cluster j and J; |; denote the inclusion indicator for unit i in cluster 
j,i = 1, ... , Nii, where j[i] denotes the cluster to which unit i belongs. Clusters are sampled with probability proportional 
to the measure of size Mj, which is known to the analyst only for the sampled clusters. Our goal is to estimate the finite 
population mean of the survey variable y, which, for a continuous variable, is defined as 


Ly. (1) 


2|2 


J 
rD) 
j=l 


MAKELA ETAL. 


in Medicine 


where y; represents the mean ofy in cluster j. For a binary outcome, we seek to estimate the population proportion, which 
is given by 


J 
i ye (2) 


where y;) is the population total in cluster j. 

Classically, inference in survey sampling has been design based. The design-based approach treats the survey outcome y 
as fixed, with randomness arising solely from the random distribution of the inclusion indicator J. Design-based estimators 
have the advantage of being design consistent, where design consistency means that the estimator will converge to the true 
value as the population and sample sizes increase under the given sampling design. However, they are often unstable with 
large standard errors. For estimating the finite population mean of an outcome y,, the classical design-based estimator 
OH = Saal where 7; is the inclusion (selection and 
response) probability of unit i. In the two-stage sample s, when J; out of J clusters are selected with n;'s sampled SSUs, for 
convenience labeled asj = 1, ... ,Js, the estimator becomes 


Js n; 
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where 7; is the selection probability of cluster j and z;); is the selection probability of unit i in cluster j, given that cluster 
j was sampled. 

The design-based approach does not require a statistical model for the survey outcomes but implicitly assumes the spe- 
cific outcome model structure and linearity. The performance relies on the validity of the model assumptions, and then 
the estimators can yield biased inference under invalid assumptions. Another major challenge with design-based esti- 
mators comes in estimating their variance. The variance of a design-based estimator generally requires knowledge of not 
only the inclusion probability z; for a given unit i but also the joint inclusion probability 2; for any two units i and i’ in 
the population. This information is often unavailable in practice, such as the unknown measure of size for nonsampled 
clusters under the PPS setting. Joint inclusion probabilities can be challenging to compute even for straightforward sam- 
pling designs, and variance estimators for design-based estimators are often based on simplifications and approximations. 
Furthermore, weighting by inverse probability of inclusion can lead to highly noisy estimators. 

Bayesian inference, in contrast, directly models both the inclusion indicators I; and the survey outcomes y;. The Bayesian 
approach to survey inference has many advantages over the design-based approach, including the ability to handle com- 
plex design features such as multistage clustering and stratification, stabilized inference for small-sample problems, 
incorporation of prior information, and large-sample efficiency. When the design variables are included in the model, 
the selection mechanism becomes ignorable,'*" and we can model the outcomes y alone, instead of jointly modeling y 
and the inclusion indicator J. The importance of including design variables in the model has also been emphasized for 
missing data imputation.’”" 

Unfortunately, in many (arguably most) practical situations, the set of design variables is not available for the entire 
population and is instead known only for sampled clusters or units. In the case of PPS sampling, where the design variables 
consist of the cluster measures of size (M Dai we as the survey analysts may only have access to Mj (or, equivalently, 
the inclusion probability z;) for the sampled subset of J, clusters. This missing information on measures of size causes 
methodology challenge in the Bayesian setting because we cannot predict the values of y for the nonsampled clusters 
without it. We need to model the values of M; for nonsampled clusters before we are able to make inferences about y 
conditional on the design information. 

Existing Bayesian approaches to this problem’ consider the case of single-stage PPS sampling, separating inference 
for the missing measure sizes and the finite population quantities into two steps. In contrast, we propose an approach 
that integrates these steps into one model for a two-stage cluster sample. Our model allows for both cluster- and unit-level 
information to be used when both are available in certain cases. For much of this paper, we assume the measure of size 
is equal to the cluster size Nj and use Nj in place of Mj for simple illustration. 

The rest of this paper proceeds as follows. Section 2 first gives an overview of the current approaches to finite population 
inference under PPS and then describes our approach and its advantages. In Section 3, we describe a simulation study 
to investigate the performance of our method and compare with other literature methods. We apply our proposal to the 
Fragile Families study in Section 4 and discuss the conclusions and extensions in Section 5. 


for a single-stage sample s of size n is the Hajek estimator," ie, 


; (3) 


4 ol wy Bey ats MAKELA Er aL. 
1\ ICING 


2 | METHODS 


In the two-stage cluster sampling, a fixed number J; of clusters are sampled with PPS so that the probability of cluster j 
being included in the sample is proportional to N; 


Pr(Jj = 1|Nj) « Nj. 


We only observe N;,'s for the clusters in the sample, that is, the empirical distribution of (Nj|Ij = 1). Our pro- 
posed procedure simultaneously models the population cluster sizes and the outcome and propagates the estimation 
uncertainty. 

Let x; denote the auxiliary variables that are predictive for the outcome. The observed data are (ops, Xops, Nobs, 
X1:3,N,J,Js), where X,.; is the cluster-level mean of the covariate x for all clusters j = 1, ... ,J, and N, J, and J; are the 
total population size, total number of clusters, and number of sampled clusters, respectively. The subscript obs denotes 
the observed portions of the variables, ie, y,,, = (y, : i = 1,... .nyayj = 1... Js), Xons = Oi ti = 1... nay j = 

,Js), and Nops = (Nj : j = 1, ... , Js), where, for convenience, we number the sampled clusters j = 1, ... ,J; and 
the nonsampled clusters as j = J; + 1, ... ,J. We assume that x; is known for all sampled units and that x; is known for 
all clusters. If x is a demographic covariate, in practice, it is often the case that we know demographic characteristics of 
clusters even if the cluster size is unknown. 

The goal is to estimate the finite population mean y, defined for a continuous outcome, 


J J 
= Nj_ 1 : = _ = 
VS 1N! = N (y (1 Yobs, j + (N; -_ Nj )Vexe, j) oh >, Na Toss) ’ 


j=l jaS,+1 


where yp; ; is the mean of the sampled units in sampled cluster j, y,,.; is the mean of the nonsampled units in cluster j, 
and Nexc,; is the size of nonsampled cluster j that is unknown. For a binary outcome, the population proportion is 


J, J 
y= ye AT +(y (Yobs.() + Yexe,(j)) + y rat) 


= j=J,+1 


where y;) is the total of all units in cluster j, y,,.,;) is the total of sampled units in sampled cluster j, and y,,0,;) is the total 
of the binary outcome in nonsampled units in cluster j. 
We assume that the continuous survey outcome y is related to the covariate x and cluster sizes Nj; in the following way: 


Yi ~ N (Bojta + Brjta%ir Oy) . 
foi ~ N (a0 + yolog® (Nj) .o5,) 
fy~N (a: + nlog® (Nj) 3.) ° 

Ny ~ P(N; |#). : 


where ¢ are the parameters governing the distribution of the cluster sizes N;. We are using cluster size in the model, not 
so much because we expect it will necessarily be a good predictor but because not including cluster size can induce bias 
if cluster size is correlated with the outcome of interest. 

The model allows the regression coefficients to vary by cluster and depend on the cluster sizes. We use a multilevel 
model to borrow information across clusters. Classical regression with cluster membership indicators could also be used 
to quantify the cluster effect but at the cost of increasing the variance, as has been shown in the context of missing 
data imputation."*’° In addition, if the goal is to make predictions for unobserved clusters, some model for these cluster 
parameters is needed. 

Our model for a binary outcome is as shown earlier, except that we modify (4) to be 


Pr(yi = 1) = logit”! (Bo,ti)), (8) 


and omit (6). We exclude unit-level covariates in the binary case because the nonlinearity of the logistic function makes 
it challenging to make use of data at the unit level. Specifically, predicting y,,,; requires knowing x; for all nonsampled 
units in cluster j, and if we knew this, clearly we would have also known N; for nonsampled clusters j. 

We use the centered logarithms of the cluster sizes log*(N;) as predictors; we work on the logarithm scale to better 
accommodate large cluster sizes and center for interpretation convenience. The sampling is assumed to be ignorable after 
including the design variables in the outcome model. 
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We assign an estimation model p(N; | ¢) to the cluster sizes, which we observe only for the sampled clusters. We develop 
both nonparametric and parametric modeling strategies to predict the cluster sizes of nonsampled clusters. 

We use y to denote the regression parameters yw = (a0, 79, 60; 41,71, 01, Oy), associated with the outcome modeling and 
6 for all parameters of interest, ie, 9 = (yw, d). The likelihood for the observed data is 


P(Y obs» Nobs | Xops, 9) « P(Yobs | Xobss-Nobss w)P(Nobs |), 


and the posterior distribution is 


po | Yobs» Xobss Nobs) & P(obs | Xopss-Nobss w)P(Nobs | d)p(y)p(P), 


where we assume that y and ¢ are independent, allowing us to write p(0) = p(w)p(¢). 

Because of the independence, invariance, and ignorability assumptions in the two-stage cluster sampling, the distribu- 
tion of the outcome y, given the design variables, is the same in the sample and the population; that is, the observed data 
likelihood is the same as the complete data likelihood 


P(Yobs | Xops, Nobss W) = p(y|x,N,I =lLy) = p(y |x, N,y). 


Here, p(y | x, N, y) is specified by (4)-(6) for continuous y and by (8) and (5) for binary y. 
The challenge lies in estimating the distribution of the N;'s when the sampling is informative. Under PPS sampling, the 
probability of observing a cluster of size Nj is 


P(N; |Z) = 1) « Pry = 1| N,)p(Nj) 
x Njp(N;), (9) 


where the population size N is fixed. Next, we release the assumption of fixed N and consider both nonparametric 
and parametric modeling strategies for the prior distribution p(Nj) (also called the population distribution, to distin- 
guish from the distributions of sampled and nonsampled cluster sizes) in (7). First, we introduce the Bayesian bootstrap 
algorithm in Section 2.1 as a nonparametric approach to predicting the unobserved Nj's. Second, we investigate two 
parametric distributional assumptions in Section 2.2 for p(Nj), ie, the negative binomial and lognormal distributions. 
Here, our goal is to directly model the distribution of the cluster sizes accounting for the fact that the observed dis- 
tribution is biased from the complete population distribution. We refer to these parametric choices as size-biased 
distributions.” 

We apply the Monte Carlo approximation by screening out the posterior samples of nonsampled units (Nj | J; = 0) and 
keeping only the cases with DF: L, Nj =N-D A gai’, ;. For implementation, we screen the predicted values to get 20% 
of the posterior samples in which the total of the nonsampled cluster sizes is closest to what it should be.'* 


2.1 | Bayesian bootstrap 


For a nonparametric model of the sampled cluster sizes, we modify the Bayesian bootstrap algorithm for one-stage 
PPS sampling'*'® to be adapted for the two-stage PPS sampling. Without a parametric assumption for p(Nj), we con- 
nect p(N;j|Jj = 0) with p(Nj|J; = 1) through the empirical distributions under PPS sampling. Assume the N;'s 
observed for the sampled clusters have B unique values N*, ... ,N; and let ki, ... ,kg be the corresponding counts of 
these unique sizes, such that }', ky = Js. Let y,, denote the probability of observing a cluster of size N ;, in the sample, ie, 
wp = Pr(N; = NE |I; = 1). We can then model the counts k = (kj, ... ,kg) as multinomially distributed with total J, and 
parameters yw = (wy, ... , Wg). The observed likelihood Loys(w) is 


I J, B 
(fs = VIN ENT) santa DIN) =HNF) GS 17S th) «Ty. 
j=l j=l b=1 

where I(-) is an indicator function and I(-) = 1 if the inside expression is true and 0 otherwise. The y's are given a nonin- 
formative Haldane prior, ie, p(yy, ...,.W,) = Dirichlet(O, ... ,0), a conjugate Dirichlet prior distribution. The posterior 
distribution of y is then 


PMN, oneg UB | ky, aoe , Kp) = Dirichlet(k,, aoe , Kp). 


Suppose the unique values of N;'s cover all possible values in the population. Assume ky is the number of nonsampled 
clusters with size NF; for b = 1, ... ,B, and let We denote the probability of an unobserved cluster having size N;: We = 
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Pr(N; = N : |I; = 0). Then, the counts of the B unique sizes among the nonsampled clusters, ie, (KE cas aka) follow a 

multinomial distribution with total J —J, = Lok and probabilities (yw, ... Wp) 


B 

xk* 

PU, KE IT—Jo wt, vp) xT], 
b=1 


Using Bayes’ rule, we can write y* as 
Pri; =0|N; = Nz) 
x Pr(N; = N; |) = 1)>———__—__ 
Pr(J; = 1|N; = Nz) 


1—Zp 
=Wb > 
Ib 


where zp = Pr; = 1|N; = N;) = JsN;/N is the conditional cluster selection probability known in the PPS sample, 
J; is the number of sampled clusters, and N is the population size. This approach essentially adjusts the probability of 
resampling an observed size N;* by the odds ofa cluster of that size not being sampled, so that smaller sizes are upweighted 
relative to larger ones. 

Given the posterior draws of Ww ‘sand ky 's, we create k; replicates of size N*, yielding a sample of the nonsampled cluster 
sizes from their posterior predictive distribution. The Bayesian bootstrap for cluster sampling is similar to the “two-stage 
Polya posterior" approach,’* which simulates draws that form a population of clusters and then an entire population of 
elements within each cluster. Survey weights are incorporated in Bayesian bootstrap for multiple imputation in two-stage 
cluster samples.”° A similar approach is used’ to estimate the poststrafication cell sizes constructed by the survey weights. 

The Bayesian bootstrap avoids parametric assumption on the population distribution p(N;) and use the empirical dis- 
tribution in the observed clusters. This implicitly introduces a noninformative prior distribution on N;'s. However, this 
approach restricts the draws for the nonsampled cluster sizes to come from the set of observed cluster sizes, where 
small clusters may be omitted under PPS sampling. While the Bayesian bootstrap is a robust algorithm for predicting the 
unknown N's, we can achieve efficiency gains with a parametric distribution on p(Nj), especially when prior distribution 
information is available. 


2.2 | Size-biased distributions 


Inducing parametric sized-biased distributions follows the superpopulation concept in the model-based survey inference 
literature and incorporates informative prior information. In practice, we may have some knowledge about the cluster 
sizes, such as the distribution in a similar population or from previous years. We can incorporate this additional infor- 
mation through the prior distribution specification to calibrate and improve the inference.”'” Sized-biased distributions 
were proposed for population size estimation.'’ We consider a discrete and a continuous distribution as candidates for 
modeling the size distributions. The observed likelihood is connected with the proposed population distribution via (9). 
Using the PPS sample, we can estimate the parameters in the population distribution and then predict the nonsampled 
cluster sizes. 

For the discrete case, we assume that the population cluster sizes N; follow a negative binomial distribution, ie, Nj ~ 
NegBin(k, p), with k > Oandp € (0,1). By normalizing the distribution in (9) and completing the algebra shown as 
follows, we see that the sizes in the PPS sample can be written as N; = 1 + Wj, where W; ~ NegBin(k + 1, p). 

For m = 0,1,2, ..., the probability of observing Nj; = m in the PPS sample is 


Prd; = 1|N; =m) PriNj =m 
nMenipens = ee 


m-1 


2 a a ae 


= Pr(W =m-1), 


where W ~ NegBin(k + 1, p). 
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For the continuous case, we use the lognormal distribution. If the population distribution is Nj ~ lognormal(y, 7’), then 
(Nj | Jj = 1) ~ lognormal(y + t*, 77). To see this, recall that P(N;) denotes the pdf of size variables N; in the population. 
Then, the pdf of Nj in the PPS sample is 


Prd; = 1| N,)p(Nj) 
Pri; = 1) 


-1 
_ (log Nj-H)? ) 
( 2at ) exp ( ae 


=1 = 
Al 2nt ) exp (— SS") dy, 


_ 2 
exp (-“ H) ) 
ee Toe : (10) 
A exp (-“") dN; 


21? 


P(N; |Jj =1) = 


We can now simplify the denominator 


co log N; — 1)? 2 
| exp _GogN;= 0" dN; = V2at exp | w+ FA, (11) 
0 212 2 


Now, substitute (11) for the denominator in (10) 


log N; — py? 2 
PON My = Y= exp (-SERE IM (u )) 


T 
Wnt 212 > 


iT (log N; ~(u+72)) 
= —— sp | <b 
N,V 2at 2t 


Thus, the distribution of sampled cluster sizes in the PPS sample is (N; | Jj = 1) ~ lognormal(u + T?,T). 

Regardless of the parametric model we choose, in order to generate predictions of the nonsampled cluster sizes, we 
need to draw from p(N; | I; = 0). We apply rejection sampling and use samples from p(Nj;) to approximate the sampling 
from p(N; | J; = 0) 


Prd; = O|N,)p(N;) 4 


P(N; | J; = 0) = Prd, = 0) = Gp(Nj), 


where G = Pr(I; = 0| N;)/ Pr(; = 0) has a constant upper bound shown as follows. The marginal probability selection 
for cluster jis Prdj = 1) = J;/J, and the joint distribution of (Nj, J;) in the PPS sample is p(Nj, Jj; = 1) = cNj,p(Nj), where 
cis aconstant. Furthermore, we have 


Prd; = 1) = [peut = 1)dp(Nj) = | cN;p(N;)dp(N;) = c E(N;). 
N, N, 


Hence, c = J;/(JE(Nj)). Then, 


JN; 


1— Prd; =1|Nj;) 1 Ew) 
~ 1-Prj;=1 °° 1-4 fd 


Assume E(Nj) = N/J, approximated by the finite sample average cluster size,’* such that 
JN; 
—— 
G= ae 
1-Js/J ~ J-Js 


Given the posterior distribution of p(N; | —), we use rejection sampling to obtain posterior predictive samples from 
P(N; | Jj = 9, -). 
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2.3 | Prior specification and computation 


We use the following weakly informative prior distributions”: 


ind 
a, Yo, 1,71 ~ N(0, 10) 


ind 
O4,,05,,0 ~ Cauchy*(0, 2.5). 


Here, Cauchy* (0, 2.5) denotes a Cauchy distribution with location 0 and scale 2.5 restricted to positive values. The weakly 
informative prior specification will allow the group-level variance parameters to be close to 0 and have large tail values. 

For the parameters governing the distribution of Nj, such as (k, p) in the negative binomial distribution or (yu, 7) in the 
lognormal distribution, we can use noninformative priors when the number of clusters sampled is large. However, when 
only a few clusters are sampled, we need informative prior information to counteract the sparsity of the data and stabi- 
lize the inference. This is particularly true when using a model for the cluster sizes that includes implicit assumptions 
about the data. For example, as an overdispersed extension of the Poisson distribution, the negative binomial distribution 
assumes that the data come from a distribution whose mean is smaller than the variance. However, in a sample of only 
five clusters, it may well be that the sample mean is larger than the sample variance, making it difficult to fit the negative 
binomial distribution to the data without strong prior information. In this case, we reparameterize the negative binomial 
as a Gamma mixture of Poisson distributions and place a prior on the coefficient of variation (CV), ie, the standard devi- 
ation divided by the mean. In this case, the CV works out to the reciprocal of the square root of the scale parameter of the 
Gamma distribution. With a small number of clusters, we expect the CV to be close to one and therefore use an exponen- 
tial prior distribution with rate 1. For the lognormal distribution, we place a Cauchy* (0, 2.5) prior on the scale parameter 
t. To aid estimation for the case with only a few sampled clusters, we standardize the log of the sampled cluster sizes by 
subtracting their mean and dividing by the standard deviation. 

For the continuous outcome, in nonsampled clusters j, the posterior predictive distribution for Y,,. ; is 


Vere | ~ N (Boj + BrjXj,05/N)) ; 


where we assume x; is known. Specifically, we draw new values of fj, 61 ;, oy, and Nj from their posterior distributions 
and then draw y,,, from the aforementioned distribution. In sampled clusters, the posterior predictive distribution for 
the nonsampled units is 


Vere 1+ ~ N (Boj + B1jX;.05/(Nj — nj). 


When Ny; is large compared to nj, as is the case in many large-scale surveys and specifically in the Fragile Families study, 
Yexe,; 1S Close to the cluster mean y, and is well approximated by fo; + f1;x;, which we calculate using the posterior means 
of Bo; and f,;. 

The posterior computation is implemented in Stan, which conducts full Bayesian inference and generates the pos- 
terior samples. The estimation for the outcome model and the cluster size model can be integrated into the posterior 
computation, which allows for uncertainty propagation throughout the parameter estimates, in contrast to previous 
approaches.'*!8 

Stan is unique in providing detailed warnings and diagnostics to inform the user when posterior inferences may be 
unreliable due to difficulties in sampling. Divergent transitions indicate that the sampler is unable to explore a portion of 
the parameter space, which can lead to significant bias in the resulting posterior distribution and ultimately unreliable 
inferences.” Stan reports the number of divergent transitions for each run, and even one divergent transition indicates that 
the results may be suspect. If divergent transitions occur, we follow the recommendation of Stan developers and iteratively 
increase the target acceptance rate adapt_delta.” If divergent transitions occur even with adapt_delta= 0.99999, 
we switch to the noncentral parameterization and follow the same procedure for increasing adapt delta as necessary. The 
noncentral parameterization is a mathematically equivalent formulation for the model that can avoid posterior geometries 
that are difficult for HMC to explore.”>’ 

To understand the importance of explicitly controlling for all design variables in this context, we also fit a model similar 
to (4)-(7) but with vy, and y, set to 0. Such a model accounts for the hierarchical cluster nature of the data by allowing fy 
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and f, to vary by cluster but does not account for the PPS sampling design because the cluster sizes N; are excluded from 
the model 


yi ~ N (Boyt + Prjta%i, oF) (if continuous) 
Pr(yj = 1) = logit” (Bojtiy) (if binary) 
Pa ees (%. °%,) (12) 


fy ~N (a0 ). 


3 | SIMULATION STUDY 


We perform a simulation study to compare the performance of our integrated approaches with classical design-based 
estimators on the statistical validity of the finite population inference. We generate a population from which we take 
repeated two-stage cluster samples under PPS and use each of the methods to estimate y. The population consists of 
J = 100 clusters, with cluster sizes N; drawn from one of two distributions. The first is a Poisson distribution with rate 500. 
The second is a multinomial distribution over scaled Gamma-distributed sizes. Specifically, we draw J = 100 candidate 
cluster sizes N; as N; = 100G;, where G; ~ Gamma(10, 1). We then take a multinomial draw from these 100 unique sizes, 
with the J-vector of probabilities drawn from a Dirichlet distribution with concentration parameter 10, which disperses 
probability mass equally across the J = 100 components. In both cases, to avoid clusters that would be selected with 
probability 1, we resample the J cluster sizes until none are so large to be selected with certainty. 
For continuous outcome, we simulate the population unit value y, from the following: 


yi ~ N (Bost) + Bra%i, o5) 
Boj ~ N (a0 + yolog’(Nj), o3,) 
fay ~N (a: + ylog*(Nj), 3) a) 


a, 4, 0,71 ~ N(O, 1) 
oj,,0p, ~ N*(0,0.5") 
oy ~ N*(0,0.75”), 


where N*(, 07) denotes the positive part of the normal distribution with mean y and standard deviation o. The model 
for binary y is identical, except that the first line of (13) is replaced with y; ~ Bernoulli(logit™! (fo j{i})) (and we omit f)). 

We use the same outcome model for data generation and estimation to focus on the performance evaluation of different 
approaches accounting for the design effect and avoid potential model misspecification. In practice, the outcome model 
can be adapted with flexible choices, as discussed in Section 5. We recommend that model diagnostics and evaluation are 
necessary. In Stan, we have implemented model comparisons such as the leave-one-out prediction error,** which can be 
straightforwardly applied in practice. We generate x; by sampling from the discrete uniform distribution between 20 and 
45 and center it by subtracting the mean. We assume that x; is known for all sampled units, and that x; is known for all 
clusters. The cluster sizes N;'s are only known in the sampled clusters. 

We sample J; < J clusters using random systematic PPS sampling with probability proportional to the cluster size 
Nj and n; units via simple random sampling (SRS) in each selected cluster j. We consider values of J; € {10,50} and 
nj € {0.1Nj,0.5Nj, 10,50}. When nj € {10,50}, the sample is self-weighting, meaning each unit has an equal probability 
of selection. To see this, recall that the probability of sampling cluster j is z; « Nj. Since within-cluster sampling is done 
with SRS, the probability of sampling unit i given cluster j is selected is 2;\; = nj/Nj; = n/N; when n;, is the same for 
all clusters. The marginal probability of sampling unit iis therefore 7; = ajaj\; « Nj-(n/Nj) = n, which is constant 
across units and clusters. Even though the final weights are constant, our studies show that the design features should be 
accounted in the outcome model. 

For each combination of J; and nj, we draw 100 two-stage samples from the population. For each sample, we estimate 
the finite population mean using the methods described as follows. 


* negbin: The negative binomial size-biased distribution as described in Section 2; 
* lognormal: The lognormal size-biased distribution as described in Section 2; 
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* bb: The Bayesian bootstrap as described in Section 2; 

* Hajek: The Hajek estimator in (3); 

* greg: The generalized regression estimator,”” which leverages a unit-level covariate to improve prediction. We only 
use this estimator for continuous y. We use the derived formulas‘ to estimate the variances of the Hajek and generalized 
regression estimators’; 

* cluster _inds: The model in (12), which accounts for the hierarchical nature of the data via random cluster effects 
but does not use the cluster sizes as a cluster-level predictor in modeling fo, and f,;; 

¢ knowsizes: The model in (4)-(6), where we additionally assume the cluster sizes are known for the entire population. 
This is the best scenario and will serve as a benchmark for the other Bayesian methods. 


There are three main comparisons that we make in evaluating the results of the simulation study. First, we measure 
the performance of our proposed integrated Bayesian approach against that of the classical design-based estimators; we 
do this by comparing the performance of negbin, lognormal, and bb to that of Hajek and greg. Second, among 
the Bayesian methods, we want to understand when the parametric models negbin and lognormal outperform the 
nonparametric Bayesian bootstrap bb. Third, we compare the performances of cluster_inds and knowsizes to 
understand the importance of explicitly including cluster sizes as cluster-level predictors in (5) and (6). In this case, we 
assume that cluster sizes are known for all clusters in the population and focus on the effects of incorrectly excluding or 
including the cluster sizes as cluster-level predictors in the model. 

We carefully monitor the diagnostics of computation performance for each drawn sample. If divergent transitions 
remain, we discard the sample. We monitor the estimated potential scale reduction factor R for each parameter. This diag- 
nostic assesses the mixing of the chains; at convergence, R=1.1fR > 1.1 for any parameter, we increase the number of 
iterations by 1000 until all values of Rare less than 1.1, up to 4000 iterations. If values of R > 1.1 remain with 4000 itera- 
tions, we discard the drawn sample. The results presented here are based on a minimum of 85 simulation draws for each 
combination of number of clusters sampled and number of units sampled. That is, we repeatedly draw 100 samples from 
the population and keep the L cases with good computation performance, ie, 85 < L < 100. 

The results of the simulation study are in Figures 1 to 4, with each figure displaying a different combination of outcome 
type (continuous or binary) and population cluster size model (Poisson or multinomial). In each figure, there are six 
panels displaying the six metrics with which we evaluate the methods, ie, relative bias, relative root mean squared error 
(RRMSE), coverage of 50% and 95% uncertainty intervals, and the average relative widths of the 50% and 95% uncertainty 


intervals. The relative bias is calculated as pie oe , where y is the true population mean, y, is the estimated value from 


the 7th simulation, and L is the number of simulations. The RRMSE is calculated as 1 / ple (= ‘ For the Bayesian 
methods negbin, lognormal, bb, cluster_inds, and knowsizes, the 50% (95%) intervals are calculated from the 
25th and 75th (2.5th and 97.5th) percentiles of the posterior predictive distribution for y. For the classical methods, we rely 
on asymptotic normal theory and the variance estimators.’ The relative widths of the uncertainty intervals are calculated 
by dividing the width of the uncertainty interval by the true y and averaging across the L simulations. 

In each plot, the x-axis is the metric value and the y-axis denotes different models. The panels represent the different 
within-cluster sampling schemes. The top two plots are for the fixed-percentage schemes, where nj = pN;j for p = 0.1 
and p = 0.5,j = 1,... ,Js. The bottom two plots represent the self-weighting samples, with nj = 10 and nj = 50, 
j = 1,... Js. The colors of the circles represent different first-stage sample sizes Js, Js € {10,50}. 

We now describe the results for each of these three comparisons for the four combinations of the outcome type 
(continuous and binary) and the population cluster size model (Poisson and multinomial distributions), as explained in 
the previous section. 

Bayesian methods generally yield more efficient inference than classical estimators, particularly with small number 
of clusters. For continuous y, the Bayesian models outperform the design-based estimators, both for the Poisson and the 
multinomially distributed population cluster sizes in Figures 1 and 2, respectively. The differences are rather small when 
J; = 50but pronounced when J, = 10. The Hajek estimator has large bias, particularly when the sample is self-weighting, 
but including auxiliary information as the GREG estimator does greatly reduces the bias. Still, the classical estimators 


*In some cases, the sample size is so large as to make calculating the design-based variance under a nonself-weighting design difficult. This is due to 


the A;, term in the related equations* (Equations 6.3 and 9.27), which requires generating an n x n matrix, where n = Ye ,;- When J, = 50 and 
nj = 0.5Nj, n can easily be 20000 or larger, making the matrix prohibitively large to compute. In these cases, we estimate the variance by randomly 
selecting 100 units via SRS in each sampled cluster and using those units to compute the required matrix. 
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FIGURE1 Results for continuous y with cluster sizes N; drawn from a Poisson distribution. The top two plots are for fixed-percentage 


simple random sampling (SRS) schemes, and the bottom two are for fixed-number SRS samples. Hajek: the Hajek estimator; greg: 


generalized regression estimator; bb: Bayesian bootstrap; negbin: negative binomial distribution; lognormal: lognormal distribution; 


cluster_inds: the model with random cluster effects but without the cluster size predictor; knowsizes: the model with known 


population cluster sizes. RMSE, root mean squared error [Colour figure can be viewed at wileyonlinelibrary.com] 


yield unstable results, evident in the high RRMSEs. The Bayesian estimators are preferable here with lower bias and 
RRMSE and yield short uncertainty intervals whose coverage rates are close to or above the nominal level. For binary y, 
there is little difference between the Bayesian methods and the Hajek estimator when the number of sampled clusters is 
large, ie, J; = 50. This holds for both the Poisson-distributed cluster sizes in Figure 3 and the multinomially distributed 
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FIGURE 2 Results for continuous y with cluster sizes N; drawn from a multinomial distribution. The top two plots are for fixed-percentage 


simple random sampling (SRS) schemes, and the bottom two are for fixed-number SRS samples. Haj ek: the Hajek estimator; greg: 


generalized regression estimator; bb: Bayesian bootstrap; negbin: negative binomial distribution; lognormal: lognormal distribution; 


cluster_inds: the model with random cluster effects but without the cluster size predictor; knowsizes: the model with known 


population cluster sizes. RMSE, root mean squared error [Colour figure can be viewed at wileyonlinelibrary.com] 


cluster sizes in Figure 4. When the number of sampled clusters is small, the Hajek estimator and the Bayesian methods 
have comparable bias and RRMSE. However, the coverage rates for the Hajek estimator are often below the nominal level, 
particularly when the sample is not self-weighting (top row of plots). 
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FIGURE3_ Results for binary y with cluster sizes Nj drawn from a Poisson distribution. The top two plots are for fixed-percentage simle 
random sampling (SRS) schemes, and the bottom two are for fixed-number SRS samples. Haj ek: the Hajek estimator; bb: Bayesian 
bootstrap; negbin: negative binomial distribution; lognormal: lognormal distribution; cluster_inds: the model with random cluster 
effects but without the cluster size predictor; knowsizes: the model with known population cluster sizes. RMSE, root mean squared error 
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We compare the parametric and nonparametric approaches by the predictive performance of N;'s from nonsampled 
clusters and the inference on finite population mean. We collect the posterior mean estimation of the predicted N;'s and 
compare the density with that for the true N;'s. Bayesian bootstrap is robust by yielding the density estimation of the pre- 


Relative bias 
10 (%) 50 (%) 
hajeky ° qo 
bb; 0 ro) ° ° 
negbin 4 ° ° -) 
lognormal ° ° ° jo 
cluster_inds4 ° ° ° ° 
knowsizes + ° ° ° ° 
10 (n) 50 (n) 
hajek- q ° lo 
bb} ° Oo ° r) 
negbin + ° | ° ° 
lognormal + ° ° ° 
cluster_inds{ ° ° ° ° 
knowsizes + °° ° Q 


MAKELA ETAL. 


in Medicine 


-0.08 -0.04 0.00 


-0.08 -0.04 0.00 


Number of sampled clusters © 10 © 50 


Coverage of 50% UI 


10 (%) 50 (%) 
hajeky eo} 0° 9 0 
bb ° ° -) 
negbin + ° ° ° 
lognormal o 0° co 
cluster_inds{ a) ° ° 
knowsizes }© ° ° o 
10 (n) 50 (n) 
hajek + °° o oO 
bb co ° ° 
negbin + ° ° ° ° 
lognormal oo ° ° 
cluster_inds{ ° ° ° ° 
knowsizes + oo co 


0.350.400.450.500.550.600.350.400.450.500.550.60 


Number of sampled clusters © 10 © 50 


Relative length of 50% UI 


10 (%) 50 (%) 
hajek7 © ° I] ° ° 
bb7° ° ° ° 
negbin; 0° ° | ° ° 
lognormal+ © ° ° ° 
cluster_inds|° ° ° ° 
knowsizes 7 © ° | ° ° 
10 (n) 50 (n) 
hajek; 0° ° I] ° ° 
bb; 9° ° ° ° 
negbin; © ° | ° ° 
lognormal; © ° ° ° 
cluster_inds; 0 ° ° ° 
knowsizes; © ° | ° ° 
01 02 03 04 0501 02 03 04 05 


Number of sampled clusters © 10 © 50 


Relative RMSE 


10 (%) 50 (%) 
hajek + ° ° ° ° 
bb/ ° ° ° ° 
negbin 4 ° ° ° ° 
lognormal + ° ° ° ° 
cluster_inds4 ° ° ° ° 
knowsizes + ° ° ° ° 
10 (n) 50 (n) 
hajek+ ° ° ° ° 
bb/ ° ° ° ° 
negbin 4 ° ° ° ° 
lognormal + ° ° ° ° 
cluster_inds{ ° ° ° ° 
knowsizes + ° ° ° ° 


0.00 0.05 0.100.15 0.20 0.2800 0.050.100.150.200.25 


Number of sampled clusters © 10 © 50 


Coverage of 95% UI 


10 (%) 50 (%) 
hajek; 9° ° ° ° 
bb} 0° ° oO 
negbin 4 o ° ° 
lognormal; 9 o}|o ° 
cluster_inds4 ® 06 
knowsizes} © ° 0 
10 (n) 50 (n) 
hajek; © ° 9 0 
bb; 0° °o oO 
negbin 4 ° ° ° ° 
lognormal, 0° o 0 
cluster_inds4 90 60 
knowsizes + o/;o ° ° 


0.878.900.928.950.979 .00879).900.929.950.974.000 
Number of sampled clusters © 10 © 50 


Relative length of 95% UI 


10 (%) 50 (%) 
hajek; © ° ° ° 
bb; 9° ° ° ° 
negbin; ° ° ° ° 
lognormal; © ° ° ° 
cluster_inds;° ° ° ° 
knowsizes 7 © ° ° ° | 
10 (n) 50 (n) 
hajek; 0 ° ° ° 
bb; 0 fo || o ° 
negbin; 0 o|;o ° 
lognormal + ° o|; ° ° 
cluster_inds} 0 ¢ ° ° 
knowsizes; © 9 ° ° | 


02 04 06 08 1.00.2 04 06 08 1.0 


Number of sampled clusters © 10 © 50 


FIGURE4 Results for binary y with cluster sizes Nj drawn from a multinomial distribution. The top two plots are for fixed-percentage 

simple random sampling (SRS) schemes, and the bottom two are for fixed-number SRS samples. Haj ek: the Hajek estimator; bb: Bayesian 
bootstrap; negbin: negative binomial distribution; lognormal: lognormal distribution; cluster_inds: the model with random cluster 
effects but without the cluster size predictor; knowsizes: the model with known population cluster sizes. RMSE, root mean squared error 
[Colour figure can be viewed at wileyonlinelibrary.com] 


dicted N's that is close to the truth but can be off on the tails because it only uses the observed sizes. The two parametric 
approaches are sensitive about model assumptions. When the number of selected cluster is large, the three approaches 
tend to perform similarly. Both the parametric and nonparametric approaches are statistically valid and have competitive 
performances. For continuous y, the parametric models negbin and lognormal perform comparably to the nonpara- 
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metric bb with unbiased estimates and similar RRMSEs in Figures 1 and 2 particularly under large J;, whereas coverage 
is generally higher for the parametric models in Figure 2. For binary y, with Poission-distributed cluster sizes, the para- 
metric models have a bit higher bias in Figure 3, ranging around 1-1.5%, whereas, for multinomially distributed cluster 
sizes in Figure 4, the parametric models are less biased than the nonparametric one, especially when the sample is not 
self-weighting and the number of clusters is small. Coverage rates vary but are most consistently around or above the 
nominal level, both for the parametric and nonparametric methods. For both continuous and binary y, there is a little 
difference in RRMSEs and uncertainty interval lengths between the parametric and nonparametric methods. 

Incorrectly omitting cluster sizes as cluster-level predictors, that is, using cluster_inds instead of knowsizes, has 
small impact when y is continuous for either the Poisson or the multinomially distributed population cluster sizes. The 
bias, RRMSE, and coverage rates for the two methods are similar in both Figures 1 and 2, even though knowsizes 
has subtle improvement. The differences between cluster _inds and knowsizes are minor for binary y as well; 
cluster_inds does not perform appreciably worse than knowsizes in either Figures 3 or 4, the Poisson, or the multi- 
nomially distributed population cluster sizes. This is because the coefficients y, of log“(N;) in the simulation are small 
(-0.341 in the Poisson case and 0.097 in the multinomial case). If y and Nj are unrelated, it is not necessary to include N; 
in the model, even under PPS sampling; allowing the regression parameters to vary by cluster as in cluster_inds is 
sufficient for valid inference. In the application study of Section 4, we find that including the cluster sizes as cluster-level 
predictors will substantially reduce bias and RRMSE with continuous outcome when the correlation between y and N; is 
large (y; = 1.81). However, the resulted difference is negligible under binary outcome comparing to the approach only 
including cluster indicators as random effects models. It is pivotal to account for the two-stage structure comparing to the 
PPS design. This shows that, when the sampling design is complex, including two-stage sampling, cluster sampling, PPS, 
and SRS, some design feature could play a bigger role than others. We recommend controlling for all the design features 
if possible. 


4 | FRAGILE FAMILIES STUDY APPLICATION 


To evaluate the performance of our method in a more realistic survey context, we use a modified version of the Fragile 
Families study design in conjunction with a presumed outcome model to implement the finite population inference. We 
would like to use the Fragile Families sampling frame to illustrate the benefits of Bayesian models accounting for the 
design features. For convenience, we use the outcome estimation model that is the same as the generation model, which 
assumption can be released as future extensions. 

The Fragile Families study° divided the 77 US cities with 1994 populations of 200 000 or greater into nine strata based 
on their policy environments and labor markets. Eight of the strata were for cities with extreme values in at least one of 
the three policy dimensions under consideration (labor markets, child support enforcement, and welfare generosity), and 
the ninth stratum was for cities that had no extreme values. One city was selected with PPS in each of the eight extreme 
strata, with a target sample size of 325 births in each city. In the last stratum, eight cities were selected via PPS, with 
a target sample size of 100 births in each. There was an intermediate stage of selecting hospitals, which we ignore for 
the paper illustration. We use the Fragile Families study's city population of 77 cities in 1994 as the sampling frame and 
implement two-stage cluster sampling under PPS. 

As asimulation, we use the city population (divided by 100 for computational convenience) as both the measure of size 
Mj and the number of units in the cluster Nj, though the ultimate unit of sampling in the study was births and number of 
births in cities should be accounted for. We exclude the three cities that would be selected with probability one for a total 
of J = 74 cities. For each unit in the population, we generate an outcome y according to our model in (13). While the 
original Fragile Families sampling design involves nine strata, we combine them into a single stratum. As in the actual 
study design, we sample 16 cities with probability proportional to the city population. In each sampled city, we sample 
either 325 or 100 births, depending on whether the city is a large- or small-sample city,* which results in a self-weighting 
sample. 

Figures 5 and 6 show the outputs for when the outcome is continuous and binary, respectively, in terms of relative bias, 
RRMSE, coverage rates, and relative widths of 50% and 95% uncertainty intervals. The main findings are consistent with 
the simulation studies. 

For continuous y in Figure 5, the Bayesian methods (with the exception of negbin) outperform the design-based 
estimators in terms of RRMSE and uncertainty interval width and are comparable on bias and coverage. The Bayesian 
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FIGURE5 Results for continuous y with cluster sizes N; in the Fragile Families study design. Haj ek: the Hajek estimator; greg: 
generalized regression estimator; bb: Bayesian bootstrap; negbin: negative binomial distribution; lognormal: lognormal distribution; 
cluster_inds: the model with random cluster effects but without the cluster size predictor; knowsizes: the model with known 
population cluster sizes. RMSE, root mean squared error 


methods yield uncertainty intervals that are less than half the width of those based on the design-based methods, with 
coverage rate that is close to the nominal level. Among the three Bayesian methods, bb and lognormal perform simi- 
larly, and both are better than the negbin assumption. The negative binomial population distribution performs poorly 
with large bias and RRMSE but low coverage rate. Excluding cluster sizes leads to worse performance, with higher bias 
and RRMSE and longer uncertainty intervals for cluster_inds compared to knowsizes. When the outcome and 
cluster sizes are highly correlated, including cluster sizes improves the prediction, which can be enhanced by the known 
mean values of auxiliary variables for nonsampled clusters. 

When yis binary as in Figure 6, we again see that the Bayesian methods yield better results in terms of bias, RRMSE, and 
coverage than the classical Hajek estimator. The uncertainty intervals of the Hajek estimator are the shortest but are close 
to those from the Bayesian methods. Comparing the parametric and nonparametric models, lognormal is unbiased, 
however, bb and negbin generate biased estimates. The three models have comparable RRMSEs and uncertainty interval 
lengths, with conservative coverage rates above or equal to the nominal levels. The effects of excluding the cluster sizes 
are small, with cluster_inds having only slightly larger bias and RRMSE, and lower coverage rates than knowsizes. 

To further investigate the population distribution of cluster sizes, Figure 7 shows the density plots for 100 cluster sizes 
drawn from the assumed Poisson and multinomial distributions and the 74 (noncertainty for selection) Fragile Fami- 
lies city populations. From the plots, both the Poisson distribution and the multinomial/Gamma distribution used in the 
simulation study are different from the population distribution of cluster sizes in the Fragile Families study. The clus- 
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FIGURE6_ Results for binary y with cluster sizes Nj in the Fragile Families study design. Haj ek: the Hajek estimator; bb: Bayesian 
bootstrap; negbin: negative binomial distribution; lognormal: lognormal distribution; cluster_inds: the model with random cluster 
effects but without the cluster size predictor; knowsizes: the model with known population cluster sizes. RMSE, root mean squared error 
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FIGURE7 Density plot of 100 cluster sizes drawn from a Poisson distribution with rate 500 (Pois), a Gamma/multinomial distribution 
(Multi) with a multinomial draw from Gamma(10,1)-distributed samples multiplied by 100, and the Fragile Families (FF) study design. 
The x-axis is on the original scale in the left plot and the log10 scale in the right [Colour figure can be viewed at wileyonlinelibrary.com] 


ter sizes in the Fragile Families study are highly skewed. Hence, in the application, the negative binomial size-biased 
distribution assumption is not appropriate to depict the cluster size population with poor performance. The performance 
is accessed by comparing the predictive density distribution of the nonsampled cluster sizes in Figure 8. We collect the 
posterior mean estimates of the nonsampled cluster sizes N;'s under the parametric and nonparametric approaches, 


wl Wy LEY->tatistics MAKELA ET AL. 
In Medicine 


Outcome: continuous Outcome: binary 


6e-04 0.00075; 


> 
= 4e-04 = 0.000504 
5 5 
a a 

2e-04 0.00025 | 

Oe+00 0.00000 } 

0 5000 10000 15000 20000 0 10000 20000 30000 
Cluster size Cluster size 


Model — bb — lognormal — negbin— true 


FIGURE 8 Predictive density comparison for nonsampled cluster sizes Nj in the Fragile Families study design. negbin: negative binomial 
size-biased distribution; lognormal: lognormal size-biased distribution; bb: Bayesian bootstrap; t rue: true distribution [Colour figure can 
be viewed at wileyonlinelibrary.com] 


in contrast with the true density. The Bayesian bootstrap method avoids the parametric assumption and yields robust 
inference, and the lognormal distribution as the size-biased choice is able to capture the skewness and performs compet- 
itively, which is also demonstrated in Figure 5 and Figure 6. We can modify the parametric assumptions and improve the 
inference with suitable prior knowledge. 


5 | DISCUSSION 


We propose an integrated Bayesian model for the finite population inference from a two-stage cluster sample under 
PPS. Two-stage cluster sampling is popular across health surveys, however, the corresponding model-based inference has 
methodology challenges. Our method combines predicting measures of size for nonsampled clusters with estimation for 
the population mean into a single approach that propagates uncertainty from the two steps. We consider both the para- 
metric and nonparametric methods for modeling cluster sizes. The parametric models directly account for the unequal 
probabilities of selection by using the closed-form size-biased version of the underlying population distribution, whereas 
the nonparametric Bayesian bootstrap draws from the observed cluster sizes with probabilities that are weighted by the 
odds of that cluster not being selected. 

While design-based approaches are common in survey inference, variance estimation is often challenging. Current esti- 
mation approaches include theoretical approximations’ and resampling methods.” In contrast, our integrated approach 
yields the posterior distribution for the quantities of interest about the finite population, from which variances, uncer- 
tainty intervals, and any other functions can easily be computed. The proposal accounts for the design features in modeling 
and yields inference that is consistent with design-based approaches. 

The Bayesian methods generally outperform the design-based estimators and improve inference stability, particularly 
when the number of sampled clusters is small. The performance of the parametric methods negbin and lognormal 
is comparable to that of the nonparametric Bayesian bootstrap. When extra information about the population cluster 
sizes is available, for example, from previous years or similar groups, we can incorporate through the informative prior 
information. Moreover, the parametric methods are straightforward to implement in Stan, which makes them accessible to 
researchers whose expertise is in areas outside of statistics or programming. The results for parametric and nonparametric 
methods are more similar when J; = 50 than when J; = 10 in many of the scenarios our simulation study considered. 
The parametric method is subject to model misspecification especially under small sample. We recommend using the 
parametric methods as an initial step and perform model diagnostics to select those robust against misspecification. An 
important diagnostic measure is to check whether the population cluster sizes are highly skewed, as in the case of the 
Fragile Families setup shown in Figure 7. Thus, reasonable prior knowledge of the population distribution of cluster sizes 
should guide the model choice of the parametric or nonparametric approach. 

In our study, under binary y, the Bayesian methods were less clearly superior to classical methods in estimating the 
finite population proportion. One possible reason is that few auxiliary or predictive variables are included in the model. 
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However, when the cluster sizes are highly skewed, as in the Fragile Families case, Bayesian methods perform significantly 
better, in terms of lower bias and more reasonable coverage, than the classical estimators. 

There are several interesting directions in which the current research could be extended. First, our simulation has not 
considered the case where M; # Nj; in depth. The natural next step would be to extend the Fragile Families simulation 
to include the case where the measure of size Mj is the city population, but the cluster size N; itself is the total number 
of births in the city. In doing so, we must make some additional assumptions. So far, we have assumed that we know 
M,; only for the sampled clusters, but what about N;? If both Mj and Nj are only available for sampled clusters, we shall 
predict both M; and Nj for the entire population. One idea is to assume that Nj is a function of M; and use regression 
models to predict Nj, given Mj, perhaps the on the log scale to avoid predicting negative cluster sizes and difficulties with 
cluster sizes ranging over several orders of magnitude. In the Fragile Families study, the correlation between the log of 
city population Mj and log of total births N; is 0.78, so this seems like a promising strategy. 

Second, the outcome model can be extended with flexible modeling strategies. To focus on the evaluation of different 
approaches to account for the design effect and predict the nonsampled cluster sizes, for the outcome model, the esti- 
mation model we use is the same as the data generation model. In practice, we recommend outcome modeling that is 
robust against misspecification. Flexible models in the literature can be explored, such as heteroscedasticity assumption, 
penalized spline regression models, and nonparametric Bayesian models. The multilevel models stabilize estimation via 
smoothing across clusters. The partial pooling effect can be strengthened with the generalized covariance structure, eg, 
covariance kernel functions in Gaussian process regression models. 

Another direction would be to consider a stratified PPS design as in the original Fragile Families study design. This 
extension introduces another challenge in that we would need to adjust for the strata structure in our model. For the 
parametric cluster size models, we would need to partially pool the size parameters (eg, w, @ in the negative binomial 
model and y, o in the lognormal) across strata, adding another layer of complexity to the model. 

Bayesian approaches are well equipped to account for the design features in the survey data under complex sampling 
design through hierarchical modeling. Computational software development, such as the use of Stan, makes modeling 
approaches enhance the advantage. More methodology developments are necessary to incorporate additional information 
about the sampling into modeling, such as known population size, paradata, and auxiliary variables. 
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